Zerosum Monitoring Data Project
Data Collection and Methodology
Data Collection Tools
This project utilized Kevin Huck’s Zerosum tool for performance data collection.
Zerosum efficiently captures system metrics by directly reading /proc files and interfacing with native performance collectors. Since these metrics are already maintained by the kernel, Zerosum introduces negligible overhead, ensuring that measurements accurately represent system behavior without distortion from the collection process itself.
The collected metrics encompass:
- Light Weight Process (LWP) statistics
- Hardware Thread (HWT) utilization metrics
- Memory allocation and throughput measurements
- GPU performance indicators
- Power consumption readings
- Temperature measurements
Data sampling occurred at regular 10-second intervals, providing consistent time-series observations throughout each workload execution.
Experimental Setup
Data for this study was collected from two distinct computing environments:
Thinkpad Workstation Setup
My personal Thinkpad workstation was used for generating controlled, labeled workload patterns. The relevant hardware specifications are summarized below:
| Component | Specification |
|---|---|
| Processor | 13th Gen Intel Core i7-1365U |
| Architecture | x86_64 |
| Physical Cores | 10 cores (Core(s) per socket) |
| Logical Processors | 12 CPUs (Thread(s) per core: 2) |
| CPU Max Frequency | 5.2 GHz |
| CPU Min Frequency | 400 MHz |
| L1 Cache | 352 KiB L1d (10 instances), 576 KiB L1i (10 instances) |
| L2 Cache | 6.5 MiB (4 instances) |
| L3 Cache | 12 MiB (1 instance) |
| NUMA Configuration | Single NUMA node |
I implemented five distinct multithreaded programs:
| Workload Type | Description | Configuration |
|---|---|---|
| CPU-bound | Intensive prime number calculations | Max available cores |
| Memory-bound | Allocated and manipulated 1000 MB of data | Max available cores |
| I/O-bound | Sequential disk operations with 10MB chunk writes | Max available cores |
| Deadlocked | Deliberate resource contention with mutex deadlocks | Max available cores |
| Unbalanced | Randomized resource utilization spikes | Max available cores |
To minimize environmental variables, each program was executed in isolation with maximum available cores utilized. Background processes and system services were limited to the extent possible. Each workload type was run 10 times with consistent 100-second execution intervals.
Frontier Supercomputer Collection
The Frontier supercomputer provided real-world high-performance computing workload data.
On Frontier, Kevin executed XGC—a Multiphysics Magnetic Fusion Reactor Simulator designed to model plasma dynamics from the hot core to cold wall regions. This application was deployed across 64 compute nodes, utilizing all available CPU cores and GPU resources.
The Frontier dataset provided additional dimensions beyond those captured on the Thinkpad, including: - Detailed GPU performance metrics - RAPL (Running Average Power Limit) energy and power properties
These additional metrics presented valuable opportunities for unsupervised learning exploration.
Dataset Limitations and Adjustments
Initial Approach and Challenges
Initially, I planned to analyze monitoring data from four nodes available in the Zerosum repository. However, this approach proved insufficient for meaningful application state characterization and prediction.
Several limitations influenced the dataset design: - Lack of light weight process (LWP) information (experimental setup focused on thread-level rather than process-level scheduling) - Memory metrics aggregation across all HWTs and LWPs, complicating per-thread memory analysis
Revised Observation Model
To address these constraints, I redesigned the observation model: - Individual Hardware Threads (HWTs) became the primary analytical unit - Each HWT was associated with its specific performance metrics - Corresponding memory properties from the execution run were linked to each HWT - LWP metrics were excluded from the final dataset
This was not an issue for the Frontier dataset. I used each LWP as an observation. No information was excluded from this dataset.
Feature Engineering Approach
The time-stepped nature of the performance data presented analytical challenges, particularly given the limited coverage of time-series analysis within the curriculum.
Rather than attempting to apply complex time-series methodologies, I implemented feature engineering techniques to transform temporal data into classification-compatible formats:
- Initial approach: Each time step treated as an independent observation
- Resulted in poor prediction performance (results not included in analysis)
- Revised approach: Extracted features from each HWT in the laptop dataset
- Mean values
- Standard deviation
- Rate-of-change metrics (delta terms between consecutive time steps)
These engineered features provided a more interpretable representation of the performance data while preserving the essential temporal patterns necessary for effective workload classification. Memory properties were associated with each HWT to maintain analytical coherence across the dataset.
What are the distinguishing statistical properties (means, variances, distributions) of system metrics across different workload types?
To validate and visualize the discriminative ability of the derived features, I averaged each metric across workload categories, creating comparative bar graphs. These visualizations effectively demonstrate how the derived metrics capture fundamental differences between workload types. The bar graphs shown were ones I thought were most interesting and were the most different between workloads.
What configurations lead to high runtime variability or inefficient resource mapping?
What statistical methods can be used for quick descriptive analysis?
Code
#Calculate mean for each metric and workload
summary_data = {}
workload_types = agg_features_df['workload_type'].unique()
for workload in workload_types:
summary_data[workload] = {}
for metric in metric_names:
metric_data = agg_features_df[(agg_features_df['workload_type'] == workload) &
(agg_features_df['name'] == metric)]
if not metric_data.empty:
summary_data[workload][metric] = metric_data['value'].mean()
# Normalize values to 0-1 scale for comparison
for metric in metric_names:
all_values = [data.get(metric, 0) for data in summary_data.values()]
min_val = min(all_values)
max_val = max(all_values)
if max_val > min_val:
for workload in workload_types:
if metric in summary_data[workload]:
summary_data[workload][metric] = (summary_data[workload][metric] - min_val) / (max_val - min_val)
# Create radar chart
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, polar=True)
# Set the angles for each metric
angles = np.linspace(0, 2*np.pi, len(metric_names), endpoint=False).tolist()
angles += angles[:1] # Close the loop
# Plot each workload
for i, workload in enumerate(workload_types):
values = [summary_data[workload].get(metric, 0) for metric in metric_names]
values += values[:1] # Close the loop
ax.plot(angles, values, linewidth=2, linestyle='solid', label=workload)
ax.fill(angles, values, alpha=0.1)
# Set labels and title
ax.set_xticks(angles[:-1])
ax.set_xticklabels(metric_names)
ax.set_title('Workload Type Comparison', fontsize=15, pad=25)
ax.legend(loc='upper right')
plt.tight_layout()An analysis of the spider plot reveals distinct workload characteristics across various metrics. The visualization effectively delineates expected patterns in resource utilization profiles. For example, unbalanced workloads demonstrate significant temporal variation, manifested through elevated standard deviations and pronounced differentials between consecutive time intervals, as we would expect.
CPU-bound workloads exhibit high coefficients of variation, correlating with the substantial peaks observed in user_time percentage measurements. This variance is unexpected as it indicates potential inefficiencies in execution patterns. The elevated coefficient of variance values across multiple dimensions suggest underlying systemic issues affecting computational consistency, which was not the intention when executing this workload.
Given that the workload consists exclusively of homogeneous prime number calculations across all threads, the observed variability is unlikely attributable to non-uniform task distribution. Rather, the data more strongly supports resource contention as the primary causal factor. The competition among identical computational processes for shared system resources presents the most probable explanation for the performance irregularities documented in the metrics.
Can we predict workload types using Zerosum data?
Code
research
label_encoder = LabelEncoder()
y_data_encoded = label_encoder.fit_transform(y)Index(['hostname', 'rank', 'shmrank', 'step', 'resource', 'type', 'index',
'name', 'value', 'workload_type', 'run_id'],
dtype='object')
Feature matrix shape: (720, 22)
Feature columns: Index(['active_memory_percentage', 'idle_time_percentage', 'idle_time_std',
'iowait_percentage', 'iowait_time_delta_avg', 'iowait_time_delta_std',
'iowait_time_std', 'mem_available_cv', 'mem_available_delta_avg',
'mem_available_delta_std', 'mem_available_std', 'mem_free_std',
'memory_fragmentation', 'memory_utilization', 'system_time_delta_avg',
'system_time_delta_std', 'system_time_percentage', 'system_time_std',
'user_time_delta_avg', 'user_time_delta_std', 'user_time_percentage',
'user_time_std'],
dtype='object', name='name')
Label distribution:
workload_type
io_bound 120
deadlock 120
error 120
mem_bound 120
cpu_bound 120
unbalanced 120
Name: count, dtype: int64
What is the minimum set of performance metrics needed to accurately classify workload types with high accuracy?
Code
# Create a new dataset with only the top 10 features
X_top10 = X[top_features_overall]
X_top10_normalized = scaler.fit_transform(X_top10)
# For training and testing, use the same splits but with reduced features
X_train_top10 = X_train[top_features_overall]
X_test_top10 = X_test[top_features_overall]
# Scale the reduced feature set
X_train_top10_scaled = scaler.fit_transform(X_train_top10)
X_test_top10_scaled = scaler.transform(X_test_top10)
# Train a new model with only the top features
top10_model = LogisticRegression(
random_state=random_state,
max_iter=max_iter,
solver='lbfgs',
C=1.0
)
# Train the model
top10_model.fit(X_train_top10_scaled, y_train)
# Make predictions
y_pred_top10 = top10_model.predict(X_test_top10_scaled)
# Evaluate performance
accuracy_top10 = accuracy_score(y_test, y_pred_top10)
print(f"\nTest Accuracy with top {num_of_features} features: {accuracy_top10:.4f}")
print(f"\nClassification Report (Top {num_of_features} Features):")
print(classification_report(y_test, y_pred_top10))
# Compare with full model
print(f"\nFull model accuracy: {accuracy:.4f}")
print(f"Top {num_of_features} features model accuracy: {accuracy_top10:.4f}")
top_features = X_train_top10.columns.tolist()
print(top_features)
Test Accuracy with top 19 features: 0.9375
Classification Report (Top 19 Features):
precision recall f1-score support
cpu_bound 0.80 0.83 0.82 24
deadlock 1.00 1.00 1.00 24
error 0.83 0.79 0.81 24
io_bound 1.00 1.00 1.00 24
mem_bound 1.00 1.00 1.00 24
unbalanced 1.00 1.00 1.00 24
accuracy 0.94 144
macro avg 0.94 0.94 0.94 144
weighted avg 0.94 0.94 0.94 144
Full model accuracy: 0.9375
Top 19 features model accuracy: 0.9375
['mem_available_std', 'mem_available_cv', 'mem_available_delta_std', 'mem_available_delta_avg', 'mem_free_std', 'iowait_percentage', 'active_memory_percentage', 'system_time_delta_std', 'memory_utilization', 'memory_fragmentation', 'user_time_delta_avg', 'system_time_percentage', 'system_time_std', 'user_time_std', 'iowait_time_std', 'idle_time_std', 'user_time_delta_std', 'system_time_delta_avg', 'iowait_time_delta_std']
Code
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis, LinearDiscriminantAnalysis
# Create QDA model
qda_model = QuadraticDiscriminantAnalysis(reg_param=1)
# Train QDA with top 10 features
qda_model.fit(X_train_top10_scaled, y_train)
# Make predictions with QDA
y_pred_qda = qda_model.predict(X_test_top10_scaled)
# Evaluate QDA performance
accuracy_qda = accuracy_score(y_test, y_pred_qda)
print(f"\nQDA Test Accuracy with top {num_of_features} features: {accuracy_qda:.4f}")
print(f"\nQDA Classification Report (Top {num_of_features} Features):")
print(classification_report(y_test, y_pred_qda))
# Create LDA model
lda_model = LinearDiscriminantAnalysis()
# Train LDA with top 10 features
lda_model.fit(X_train_top10_scaled, y_train)
# Make predictions with LDA
y_pred_lda = lda_model.predict(X_test_top10_scaled)
# Evaluate LDA performance
accuracy_lda = accuracy_score(y_test, y_pred_lda)
print(f"\nLDA Test Accuracy with top {num_of_features} features: {accuracy_lda:.4f}")
print(f"\nLDA Classification Report (Top {num_of_features} Features):")
print(classification_report(y_test, y_pred_lda))
print("\n--- Model Comparison ---")
print(f"Logistic Regression accuracy: {accuracy_top10:.4f}")
print(f"QDA accuracy: {accuracy_qda:.4f}")
print(f"LDA accuracy: {accuracy_lda:.4f}")
cv_scores_qda = cross_val_score(
qda_model, X_top10_normalized, y, groups=groups,
cv=group_kfold, scoring='accuracy'
)
print(f"\nQDA Cross-Validation Accuracy: {cv_scores_qda.mean():.4f} ± {cv_scores_qda.std():.4f}")
cv_scores_lda = cross_val_score(
lda_model, X_top10_normalized, y, groups=groups,
cv=group_kfold, scoring='accuracy'
)
print(f"\nLDA Cross-Validation Accuracy: {cv_scores_lda.mean():.4f} ± {cv_scores_lda.std():.4f}")
# fig, axes = plt.subplots(1, 2, figsize=(16, 7))
from sklearn.naive_bayes import GaussianNB
# Create Naive Bayes model
nb_model = GaussianNB()
# Train Naive Bayes with top 10 features
nb_model.fit(X_train_top10_scaled, y_train)
# Make predictions with Naive Bayes
y_pred_nb = nb_model.predict(X_test_top10_scaled)
# Evaluate Naive Bayes performance
accuracy_nb = accuracy_score(y_test, y_pred_nb)
print(f"\nNaive Bayes Test Accuracy with top 10 features: {accuracy_nb:.4f}")
print("\nNaive Bayes Classification Report (Top 10 Features):")
print(classification_report(y_test, y_pred_nb))
# Update model comparison
print("\n--- Model Comparison ---")
print(f"Logistic Regression accuracy: {accuracy_top10:.4f}")
print(f"QDA accuracy: {accuracy_qda:.4f}")
print(f"LDA accuracy: {accuracy_lda:.4f}")
print(f"Naive Bayes accuracy: {accuracy_nb:.4f}")
# Cross-validation for Naive Bayes
cv_scores_nb = cross_val_score(
nb_model, X_top10_normalized, y, groups=groups,
cv=group_kfold, scoring='accuracy'
)
cv_scores_lda = cross_val_score(
lda_model, X_top10_normalized, y, groups=groups,
cv=group_kfold, scoring='accuracy'
)
cv_scores_qda = cross_val_score(
qda_model, X_top10_normalized, y, groups=groups,
cv=group_kfold, scoring='accuracy'
)
print(f"\nNaive Bayes Cross-Validation Accuracy: {cv_scores_nb.mean():.4f} ± {cv_scores_nb.std():.4f}")
print(f"\nLDA Cross-Validation Accuracy: {cv_scores_lda.mean():.4f} ± {cv_scores_lda.std():.4f}")
print(f"\nQDA Cross-Validation Accuracy: {cv_scores_qda.mean():.4f} ± {cv_scores_qda.std():.4f}")
fig, axes = plt.subplots(1, 3, figsize=(24, 7))
# QDA confusion matrix
cm_qda = confusion_matrix(y_test, y_pred_qda)
sns.heatmap(
cm_qda, annot=True, fmt='d', cmap='Blues',
xticklabels=qda_model.classes_,
yticklabels=qda_model.classes_,
ax=axes[0]
)
axes[0].set_title('QDA Confusion Matrix')
axes[0].set_xlabel('Predicted Label')
axes[0].set_ylabel('True Label')
# LDA confusion matrix
cm_lda = confusion_matrix(y_test, y_pred_lda)
sns.heatmap(
cm_lda, annot=True, fmt='d', cmap='Blues',
xticklabels=lda_model.classes_,
yticklabels=lda_model.classes_,
ax=axes[1]
)
axes[1].set_title('LDA Confusion Matrix')
axes[1].set_xlabel('Predicted Label')
axes[1].set_ylabel('True Label')
# Naive Bayes confusion matrix
cm_nb = confusion_matrix(y_test, y_pred_nb)
sns.heatmap(
cm_nb, annot=True, fmt='d', cmap='Blues',
xticklabels=nb_model.classes_,
yticklabels=nb_model.classes_,
ax=axes[2]
)
axes[2].set_title('Naive Bayes Confusion Matrix')
axes[2].set_xlabel('Predicted Label')
axes[2].set_ylabel('True Label')
plt.tight_layout()
plt.show()
QDA Test Accuracy with top 19 features: 0.8889
QDA Classification Report (Top 19 Features):
precision recall f1-score support
cpu_bound 1.00 0.33 0.50 24
deadlock 1.00 1.00 1.00 24
error 0.60 1.00 0.75 24
io_bound 1.00 1.00 1.00 24
mem_bound 1.00 1.00 1.00 24
unbalanced 1.00 1.00 1.00 24
accuracy 0.89 144
macro avg 0.93 0.89 0.88 144
weighted avg 0.93 0.89 0.88 144
LDA Test Accuracy with top 19 features: 0.8819
LDA Classification Report (Top 19 Features):
precision recall f1-score support
cpu_bound 0.67 0.58 0.62 24
deadlock 1.00 1.00 1.00 24
error 0.63 0.71 0.67 24
io_bound 1.00 1.00 1.00 24
mem_bound 1.00 1.00 1.00 24
unbalanced 1.00 1.00 1.00 24
accuracy 0.88 144
macro avg 0.88 0.88 0.88 144
weighted avg 0.88 0.88 0.88 144
--- Model Comparison ---
Logistic Regression accuracy: 0.9375
QDA accuracy: 0.8889
LDA accuracy: 0.8819
QDA Cross-Validation Accuracy: 0.8361 ± 0.0436
LDA Cross-Validation Accuracy: 0.8542 ± 0.0837
Naive Bayes Test Accuracy with top 10 features: 0.8611
Naive Bayes Classification Report (Top 10 Features):
precision recall f1-score support
cpu_bound 0.80 0.67 0.73 24
deadlock 0.75 1.00 0.86 24
error 0.60 0.50 0.55 24
io_bound 1.00 1.00 1.00 24
mem_bound 1.00 1.00 1.00 24
unbalanced 1.00 1.00 1.00 24
accuracy 0.86 144
macro avg 0.86 0.86 0.85 144
weighted avg 0.86 0.86 0.85 144
--- Model Comparison ---
Logistic Regression accuracy: 0.9375
QDA accuracy: 0.8889
LDA accuracy: 0.8819
Naive Bayes accuracy: 0.8611
Naive Bayes Cross-Validation Accuracy: 0.9028 ± 0.0534
LDA Cross-Validation Accuracy: 0.8542 ± 0.0837
QDA Cross-Validation Accuracy: 0.8361 ± 0.0436
Coefficient table (Coefficient Std. error z-statistic p-value) Log Reg LDA ROC Curve QDA
Can specific quantitative thresholds in key metrics reliably differentiate workloads?
Code
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Create and train the tree
tree_model = DecisionTreeClassifier(max_depth=4, random_state=42)
tree_model.fit(X_train_top10_scaled, y_train)
# Accuracy
tree_pred = tree_model.predict(X_test_top10_scaled)
tree_accuracy = accuracy_score(y_test, tree_pred)
print(f"Decision Tree Accuracy: {tree_accuracy:.4f}")
# # Create tree visualization
plt.figure(figsize=(20, 12))
plot_tree(tree_model, feature_names=top_features, class_names=list(tree_model.classes_),
filled=True, rounded=True, fontsize=10, proportion=True)
plt.title(f"Decision Tree for Workload Classification (Accuracy: {tree_accuracy:.4f})")
plt.tight_layout()
# plt.savefig("decision_tree_thresholds.png", dpi=300, bbox_inches='tight')
plt.show()Decision Tree Accuracy: 0.8333
Code
# Select top 6 features based on feature importance from tree
minimal_features = top_features
importances = tree_model.feature_importances_
feature_importance = pd.DataFrame({'Feature': minimal_features, 'Importance': importances})
top_features = feature_importance.sort_values('Importance', ascending=False)['Feature'].head(6).tolist()
# Create histogram for each feature
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()
# Debug print
print("Top features:", top_features)
# Create a combined dataframe with features and target
plot_data = pd.DataFrame(X)
plot_data['workload_type'] = y
for i, feature in enumerate(top_features):
ax = axes[i]
# Plot histogram for this feature by workload type
sns.histplot(data=plot_data, x=feature, hue='workload_type', bins=15,
kde=True, element='step', common_norm=False, ax=ax)
# Add decision tree threshold lines if this feature appears in the tree
for node_id in range(tree_model.tree_.node_count):
if tree_model.tree_.feature[node_id] == list(minimal_features).index(feature):
threshold = tree_model.tree_.threshold[node_id]
ax.axvline(x=threshold, color='black', linestyle='--', linewidth=2)
ax.text(threshold, ax.get_ylim()[1]*0.9, f'Threshold: {threshold:.2f}',
rotation=90, verticalalignment='top')
ax.set_title(f'Distribution of {feature} by Workload Type')
ax.set_xlabel(feature)
ax.set_ylabel('Count')
# Calculate means for each workload type
means = plot_data.groupby('workload_type')[feature].mean().to_dict()
y_pos = ax.get_ylim()[1] * 0.8
for j, (wl, mean_val) in enumerate(means.items()):
ax.axvline(x=mean_val, color=sns.color_palette()[j], linestyle=':')
ax.text(mean_val, y_pos - j*ax.get_ylim()[1]*0.05,
f'{wl} mean: {mean_val:.2f}', ha='center', fontsize=8)
plt.tight_layout()
plt.savefig('feature_distributions_thresholds.png', dpi=300, bbox_inches='tight')
plt.show()Top features: ['mem_available_std', 'mem_available_delta_std', 'mem_available_delta_avg', 'idle_time_std', 'mem_available_cv', 'iowait_percentage']
Looking at these histogram distributions, we can see clear patterns that help explain why the decision tree classifier and logistic regression achieved such high accuracy.
The memory metrics show the most dramatic separations. For mem_available_delta_std, unbalanced workloads stand out with a unique bimodal pattern, while memory-bound workloads show consistently higher values than CPU and I/O workloads. The mem_available_delta_avg plot clearly isolates memory-bound workloads with their strongly negative values, showing they consistently consume memory over time.
For the time-related metrics, idle_time_std gives us a useful signal for CPU-bound workloads, which show more variability in their idle patterns. The idle_time_percentage is particularly good at identifying deadlock conditions, which hover near 99% idle time – even a 1-2% difference here is significant for classification.
The active_memory_percentage chart shows an interesting multi-modal distribution that naturally separates different workload types: I/O-bound workloads cluster around 55%, memory-bound around 54.5%, and CPU-bound near 53%. These seemingly small percentage differences provide reliable boundaries for the decision tree to split on.
Model Discussion
Of the three models, logistic regression, LDA, and QDA, logistic regression performed the best. Furthermore, only 10 features were necessary get 93% classification accuracy and achieve high F1 scores with statistical significance (chai square test).
I expected naive bayes to perform well because I had many features and less observations. I would guess that QDA would perform better if I had more observations, and LDA did not perform well because of the many features.
I was not expecting the models to turn out nearly as well as they did. Another thing I wanted to explore was my derived features. I wanted to make sure that the fact that multiple HWT observations shared the same memory did not make the data trivially classifiable. The decision tree shows that the memory statistics were very indicative of workload. The chai squared test confirmed that model results were not due to chance.
Limitations of this model is that the dataset scripts did not include a mix of workloads. The scripts were specifically designed to be extremely evident of the type of workload. I would like to use this model on monitoring data on an application of unknown workload.
Code
def extract_tree_rules(tree, feature_names, class_names):
"""Extract rules from a decision tree."""
tree_ = tree.tree_
rules = []
def recurse(node, depth, path):
if tree_.feature[node] != -2: # Not a leaf node
name = feature_names[tree_.feature[node]]
threshold = tree_.threshold[node]
# Left branch - feature <= threshold
path_left = path.copy()
path_left.append(f"{name} <= {threshold:.2f}")
recurse(tree_.children_left[node], depth + 1, path_left)
# Right branch - feature > threshold
path_right = path.copy()
path_right.append(f"{name} > {threshold:.2f}")
recurse(tree_.children_right[node], depth + 1, path_right)
else: # Leaf node
class_probabilities = tree_.value[node][0] / tree_.value[node][0].sum()
predicted_class = class_names[np.argmax(class_probabilities)]
max_prob = np.max(class_probabilities)
if max_prob > 0.7: # Only include rules with high confidence
rule = {
'path': path.copy(),
'predicted_class': predicted_class,
'confidence': max_prob,
'samples': tree_.n_node_samples[node]
}
rules.append(rule)
recurse(0, 1, [])
return rules
# Extract key rules
tree_rules = extract_tree_rules(tree_model, minimal_features, list(tree_model.classes_))
# Display key thresholds
print("\nKey Thresholds for Workload Classification:")
for i, rule in enumerate(sorted(tree_rules, key=lambda x: x['confidence'], reverse=True)[:10]):
path_str = " AND ".join(rule['path'])
print(f"{i+1}. IF {path_str} THEN {rule['predicted_class']} (confidence: {rule['confidence']:.2f}, samples: {rule['samples']})")
# Create a comprehensive threshold table
threshold_counts = {}
for feature in minimal_features:
threshold_counts[feature] = 0
for node_id in range(tree_model.tree_.node_count):
feature_idx = tree_model.tree_.feature[node_id]
if feature_idx != -2: # Not a leaf node
feature = minimal_features[feature_idx]
threshold_counts[feature] += 1
# Show feature usage in decision rules
feature_usage = pd.DataFrame({
'Feature': list(threshold_counts.keys()),
'Times Used as Threshold': list(threshold_counts.values())
})
print("\nFeature Usage in Decision Thresholds:")
print(feature_usage.sort_values('Times Used as Threshold', ascending=False).head(10))
Key Thresholds for Workload Classification:
1. IF mem_available_std <= 1.62 AND mem_available_delta_std <= -0.37 AND mem_available_delta_avg <= 0.47 AND idle_time_std <= -0.26 THEN error (confidence: 1.00, samples: 96)
2. IF mem_available_std <= 1.62 AND mem_available_delta_std <= -0.37 AND mem_available_delta_avg <= 0.47 AND idle_time_std > -0.26 THEN cpu_bound (confidence: 1.00, samples: 12)
3. IF mem_available_std <= 1.62 AND mem_available_delta_std <= -0.37 AND mem_available_delta_avg > 0.47 AND mem_available_delta_std <= -0.65 THEN cpu_bound (confidence: 1.00, samples: 84)
4. IF mem_available_std <= 1.62 AND mem_available_delta_std <= -0.37 AND mem_available_delta_avg > 0.47 AND mem_available_delta_std > -0.65 THEN deadlock (confidence: 1.00, samples: 96)
5. IF mem_available_std <= 1.62 AND mem_available_delta_std > -0.37 AND mem_available_std <= -0.19 THEN io_bound (confidence: 1.00, samples: 96)
6. IF mem_available_std <= 1.62 AND mem_available_delta_std > -0.37 AND mem_available_std > -0.19 THEN unbalanced (confidence: 1.00, samples: 96)
7. IF mem_available_std > 1.62 THEN mem_bound (confidence: 1.00, samples: 96)
Feature Usage in Decision Thresholds:
Feature Times Used as Threshold
0 mem_available_std 2
2 mem_available_delta_std 2
3 mem_available_delta_avg 1
15 idle_time_std 1
1 mem_available_cv 0
5 iowait_percentage 0
4 mem_free_std 0
7 system_time_delta_std 0
8 memory_utilization 0
9 memory_fragmentation 0
ZEROSUM -> FRONTIER XGC DATA
Code
attr = pd.DataFrame(data_dfs[0])
print(attr['name'].unique().tolist())['majflt', 'minflt', 'nonvoluntary_ctxt_switches', 'nswap', 'processor', 'pthread lock calls', 'pthread trylock calls', 'state', 'step', 'stime', 'utime', 'voluntary_ctxt_switches', 'guest', 'guest_nice', 'idle', 'idle_all', 'iowait', 'irq', 'nice', 'softirq', 'steal', 'system', 'system_all', 'total_time', 'user', 'virt_all_time', 'MemAvailable kB', 'MemFree kB', 'MemTotal kB', 'cray_pm accel0_energy (J)', 'cray_pm accel0_power (W)', 'cray_pm accel1_energy (J)', 'cray_pm accel1_power (W)', 'cray_pm accel2_energy (J)', 'cray_pm accel2_power (W)', 'cray_pm accel3_energy (J)', 'cray_pm accel3_power (W)', 'cray_pm cpu0_temp (C)', 'cray_pm cpu_energy (J)', 'cray_pm cpu_power (W)', 'cray_pm energy (J)', 'cray_pm memory_energy (J)', 'cray_pm memory_power (W)', 'cray_pm power (W)', 'cray_pm power cap changed', 'cray_pm valid', 'Bus ID', 'Can Map Host Memory', 'Clock Instruction Rate', 'Clock Rate', 'Compute Mode', 'Concurrent Kernels', 'GPU_ID', 'Is Multi GPU Board', 'L2 Cache Size', 'Major Compute', 'Max Grid Size', 'Max Shared Memory per Multi Processor', 'Max Threads Dim', 'Max Threads per Block', 'Max Threads per Multi Processor', 'Memory Bus Width', 'Memory Clock Rate', 'Minor Compute', 'Multi Processor Count', 'Name', 'PCI Bus ID', 'PCI Device ID', 'RT_GPU_ID', 'Registers per Block', 'Shared Memory per Block', 'Total Const Memory', 'Total Global Memory', 'Warp Size', 'Clock Frequency, GLX (MHz)', 'Clock Frequency, SOC (MHz)', 'Device Busy %', 'Energy Average (J)', 'GFX Activity', 'GFX Activity %', 'Memory Activity %', 'Memory Busy %', 'Memory Controller Activity', 'Throttle Status', 'Total GTT Bytes', 'Total VRAM Bytes', 'Total Visible VRAM Bytes', 'UVD|VCN Activity', 'Used GTT Bytes', 'Used VRAM Bytes', 'Used Visible VRAM Bytes']
Can unsupervised learning on the Frontier dataset reveal useful utilization insights on applications like XGC?
Index(['hostname', 'rank', 'shmrank', 'step', 'resource', 'type', 'index',
'name', 'value'],
dtype='object')
Feature matrix shape: (1088, 67)
Feature columns: Index(['accel0_power', 'accel0_power_std', 'accel1_power', 'accel1_power_std',
'accel2_power', 'accel2_power_std', 'accel3_power', 'accel3_power_std',
'active_memory_percentage', 'cpu_power', 'cpu_power_std', 'cpu_temp',
'cpu_temp_std', 'gpu_busy_delta_avg', 'gpu_busy_delta_std',
'gpu_busy_percentage', 'gpu_busy_std', 'gpu_clock_glx_std',
'gpu_clock_soc_std', 'gpu_gtt_utilization', 'gpu_mem_busy_percentage',
'gpu_mem_busy_std', 'gpu_used_vram_std', 'gpu_vram_utilization',
'idle_time_percentage', 'idle_time_std', 'iowait_percentage',
'iowait_time_delta_avg', 'iowait_time_delta_std', 'iowait_time_std',
'lwp_majflt', 'lwp_majflt_std', 'lwp_minflt', 'lwp_minflt_std',
'lwp_nonvol_ctx', 'lwp_nonvol_ctx_std', 'lwp_stime',
'lwp_stime_delta_avg', 'lwp_stime_delta_std', 'lwp_stime_std',
'lwp_utime', 'lwp_utime_delta_avg', 'lwp_utime_delta_std',
'lwp_utime_std', 'lwp_vol_ctx', 'lwp_vol_ctx_std',
'mem_available_delta_avg', 'mem_available_delta_std',
'mem_available_std', 'mem_free_std', 'mem_total_std',
'memory_fragmentation', 'memory_power', 'memory_power_std',
'memory_utilization', 'system_time_delta_avg', 'system_time_delta_std',
'system_time_percentage', 'system_time_std', 'total_power',
'total_power_delta_avg', 'total_power_delta_std', 'total_power_std',
'user_time_delta_avg', 'user_time_delta_std', 'user_time_percentage',
'user_time_std'],
dtype='object', name='name')
Code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
def get_combined_colormap(n_colors):
"""Generate a large number of distinct colors by combining multiple colormaps"""
# Collect colors from various colormaps
tab10_colors = plt.colormaps['tab10'](np.linspace(0, 1, 10))
tab20_colors = plt.colormaps['tab20'](np.linspace(0, 1, 20))
tab20b_colors = plt.colormaps['tab20b'](np.linspace(0, 1, 20))
tab20c_colors = plt.colormaps['tab20c'](np.linspace(0, 1, 20))
# Combine all colors
all_colors = np.vstack([tab10_colors, tab20_colors, tab20b_colors, tab20c_colors])
# Return the requested number of colors (with cycling if needed)
return all_colors[:n_colors % len(all_colors)]
# Group related metrics for better visualization
def create_metric_groups(feature_columns):
power_metrics = [col for col in feature_columns if 'power' in col]
memory_metrics = [col for col in feature_columns if any(x in col for x in ['memory', 'mem'])]
cpu_metrics = [col for col in feature_columns if 'cpu' in col]
gpu_metrics = [col for col in feature_columns if 'gpu' in col]
time_metrics = [col for col in feature_columns if 'time' in col and 'lwp' not in col]
lwp_metrics = [col for col in feature_columns if 'lwp' in col]
return {
'Power': power_metrics,
'Memory': memory_metrics,
'CPU': cpu_metrics,
'GPU': gpu_metrics,
'Time': time_metrics,
'LWP Time': lwp_metrics
}
# Function to create stack plots by hostname
def create_stack_plot_by_metrics(pivot_df, metric_group_name, metrics, normalize=True, figsize=(14, 8)):
"""
Create stack plots for a group of related metrics across different hostnames
"""
# Select data for the metrics
plot_data = pivot_df[['hostname'] + metrics].copy()
# Normalize the data if requested
if normalize:
for metric in metrics:
max_val = plot_data[metric].max()
if max_val > 0: # Avoid division by zero
plot_data[metric] = plot_data[metric] / max_val
# Group by hostname and calculate mean of each metric
grouped_data = plot_data.groupby('hostname')[metrics].mean()
# Create the stack plot
fig, ax = plt.subplots(figsize=figsize)
# Get hostnames for x-axis
hostnames = grouped_data.index
# Prepare data for stacking
stacked_data = np.zeros((len(metrics), len(hostnames)))
for i, metric in enumerate(metrics):
stacked_data[i] = grouped_data[metric].values
colors = get_combined_colormap(len(metrics))
# Create stack plot
ax.stackplot(range(len(hostnames)), stacked_data, labels=metrics, alpha=0.8, colors=colors)
# Customize the plot
ax.set_title(f'{metric_group_name} Metrics Across Hostnames {"(Normalized)" if normalize else ""}')
ax.set_xticks(range(len(hostnames)))
ax.set_xticklabels(hostnames, rotation=45, ha='right')
ax.set_xlabel('Hostname')
ax.set_ylabel('Value')
ax.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.tight_layout()
plt.show()
return fig
# Create multiple stack plots for each metric group
def create_all_stack_plots(pivot_df, normalize=True):
# Create metric groups
metric_groups = create_metric_groups(pivot_df.columns)
# Create a stack plot for each group
figures = {}
for group_name, metrics in metric_groups.items():
if metrics: # Only if we have metrics in this group
figures[group_name] = create_stack_plot_by_metrics(
pivot_df,
group_name,
metrics,
normalize=normalize
)
return figures
# Create stack plots with normalized values
stack_plots = create_all_stack_plots(pivot_df, normalize=True)These charts are very useful for a number of reasons.
Power is a very important metric in the HPC context. This metric is especially useful for systems like Frontier, where scientists must take care to not exceed the power budget. The plot shows a mostly even power workload except for two nodes which use slightly more power than the rest.
The next plots show the very reason for the slight spike in power consumption. The GPU plot is likely driving the power and memory consumption plots, as well as the process plots. We can see that on the thread level, we are seeing high variance in user and idle time. However on the process level time plots, we are seeing high utime over time. This can also be explained by the usage of the GPU. Hardware treads are either handling data for the GPU or waiting for the GPU to finish its work. On the process level, this is all a utime activity.
Code
k = 3
tids = pivot_df['index'].values
feature_cols = X.columns
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=random_state)
cluster_labels = kmeans.fit_predict(X_scaled)
# # Add cluster labels to the data
# data_with_clusters = X.copy()
# data_with_clusters['cluster'] = cluster_labels
# Get the top features for each cluster
cluster_centers = pd.DataFrame(
kmeans.cluster_centers_,
columns=feature_cols
)
# Transform the cluster centers back to the original scale
cluster_centers_orig = pd.DataFrame(
scaler.inverse_transform(kmeans.cluster_centers_),
columns=feature_cols
)
# Calculate feature importance for each cluster
feature_variance = cluster_centers.var().sort_values(ascending=False)
# Normalize to get relative importance
feature_importance = feature_variance / feature_variance.sum()
# Use PCA for dimensionality reduction to get better visualization
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_scaled)
pc_df = pd.DataFrame(
data=principal_components,
columns=['PC1', 'PC2']
)
print(f"\nPCA Explained Variance Ratio: {pca.explained_variance_ratio_.round(4)}")
print(f"Total Explained Variance: {sum(pca.explained_variance_ratio_).round(4)}")
# See which features contribute most to the principal components
pca_components = pd.DataFrame(
pca.components_.T,
columns=['PC1', 'PC2'],
index=feature_cols
)
print("\nTop 5 features contributing to PC1:")
print(pca_components['PC1'].abs().sort_values(ascending=False).head(5))
print("\nTop 5 features contributing to PC2:")
print(pca_components['PC2'].abs().sort_values(ascending=False).head(5))
# Combine with original data for plotting
pc_df['cluster'] = cluster_labels
# Plot clusters in PCA space
plt.figure(figsize=(12, 10))
# Plot clusters
scatter = plt.scatter(
pc_df['PC1'],
pc_df['PC2'],
c=pc_df['cluster'],
cmap='viridis',
alpha=0.7,
s=50,
edgecolors='w'
)
# Plot cluster centers in PCA space
centers_pca = pca.transform(kmeans.cluster_centers_)
plt.scatter(
centers_pca[:, 0],
centers_pca[:, 1],
c='red',
marker='X',
s=200,
alpha=1,
label='Cluster Centers'
)
plt.title(f'K-means Clusters (k={k}) in PCA Space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
legend1 = plt.legend(*scatter.legend_elements(), title="Clusters")
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)
plt.show()
hostnames = pivot_df['hostname'].values
# PCA w 2 Components
principal_components = pca.fit_transform(X_scaled)
pc_df = pd.DataFrame(
data=principal_components,
columns=['PC1', 'PC2']
)
# Add hostname and tid information to the PC dataframe instead of cluster labels
pc_df['hostname'] = hostnames
pc_df['tid'] = tids
# Get unique hostnames for coloring
unique_hostnames = pc_df['hostname'].unique()
# Create a color map for the hostnames
import matplotlib.colors as mcolors
from matplotlib.lines import Line2D
# Distinct color list
tab10 = plt.colormaps['tab10']
tab20 = plt.colormaps['tab20']
tab20b = plt.colormaps['tab20b']
tab20c = plt.colormaps['tab20c']
colors = [tab10(i % 10) for i in range(10)]
colors.extend([tab20(i % 20) for i in range(20)])
colors.extend([tab20b(i % 20) for i in range(20)])
colors.extend([tab20c(i % 20) for i in range(20)])
# Assign a color to each hostname
hostname_colors = {hostname: colors[i % len(colors)] for i, hostname in enumerate(unique_hostnames)}
# Plot the PCA with hostname coloring
plt.figure(figsize=(12, 10))
# Plot points colored by hostname
for hostname in unique_hostnames:
subset = pc_df[pc_df['hostname'] == hostname]
plt.scatter(
subset['PC1'],
subset['PC2'],
color=hostname_colors[hostname],
alpha=0.7,
s=50,
edgecolors='w',
label=hostname
)
plt.title('Data Points in PCA Space by Node (hostname)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
# Create custom legend if there are many hostnames
if len(unique_hostnames) > 10:
# Create a more compact legend for many hostnames
legend_elements = [Line2D([0], [0], marker='o', color='w',
markerfacecolor=hostname_colors[h], markersize=8, label=h)
for h in unique_hostnames]
plt.legend(handles=legend_elements, loc='best', ncol=2, fontsize='small')
else:
plt.legend(loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# PCA Display for hostname
print(f"\nPCA Explained Variance Ratio: {pca.explained_variance_ratio_.round(4)}")
print(f"Total Explained Variance: {sum(pca.explained_variance_ratio_).round(4)}")
PCA Explained Variance Ratio: [0.273 0.2222]
Total Explained Variance: 0.4952
Top 5 features contributing to PC1:
name
mem_available_std 0.226308
mem_available_delta_avg 0.224839
memory_utilization 0.224655
mem_available_delta_std 0.223755
gpu_vram_utilization 0.217805
Name: PC1, dtype: float64
Top 5 features contributing to PC2:
name
lwp_stime 0.251363
lwp_majflt_std 0.251276
lwp_nonvol_ctx 0.250003
lwp_nonvol_ctx_std 0.249871
lwp_minflt 0.245866
Name: PC2, dtype: float64
PCA Explained Variance Ratio: [0.273 0.2222]
Total Explained Variance: 0.4952
The first component appears to capture workloads by memory consumption patterns and volatility. The second captures processor workload characteristics.
We see three clusters. The bottom left are clusters with low PC2 values and negative PC1 values. These may be describing processes with higher idle or user time. I would suspect idle time since most of the computations are likely being done on GPUs. The bottom right are clusters with low PC2 values and high PC1 values. These processes are using a lot of memory, but are also low in actual activity overall. In other words, they are memory intensive but not switching contexts frequently. This cluster may be describing processes that are handling GPU memory without directly handling computation. The top cluster have high PC2 values. Some low PC1 values and some have high PC1 values. We are seeing high system level interactions, high context switching, and varying memory consumption. These may be processes that are handling parallelism, experiencing resource contention, or frequently interacting with the system.
Results and Analysis
Key Findings
The analysis of performance data collected through Zerosum yielded several important insights:
- Raw performance data alone provides limited visibility into application behavior
- Visualization and derived metrics significantly improve application characterization
- A multiple logistic regression model demonstrated high accuracy in workload classification
- Resource utilization patterns can be effectively categorized even with limited training data
Evidence Supporting Conclusions
The impressive performance of the multiple logistic regression model leads me to believe that given a bigger labeled dataset, we could accurately classify application bounds for a variety of different tasks and faults.
- Visualization techniques revealed distinct patterns in resource utilization that were not apparent in raw metrics
- Calculated features (means, standard deviations, deltas) provided discriminative power in the absense of using raw time-series data
- Classification models successfully differentiated between:
- CPU-bound processes
- Memory-bound operations
- I/O-bound processes
- Deadlocked scenarios
- Unbalanced workloads
Practical Applications
This analysis opens several promising avenues for practical application:
- User-Friendly Resource Analysis: The ability to provide descriptive analysis to users who are not familiar with resource metrics
- Resource Scheduling Optimization: Potential for making evidence-based suggestions for batched job scheduling
- Automated Performance Troubleshooting: Identification of resource contention and deadlocks without manual inspection
Limitations and Challenges
Several factors limit the generalizability of the current analysis:
| Limitation | Description | Impact |
|---|---|---|
| Dataset Size | Small number of samples | Potential overfitting, limited variety of workload patterns |
| Workload Diversity | Limited variety of tasks | May not capture full spectrum of real-world applications |
| Memory Metrics | Shared memory property for each hardware thread | Likely inflated model accuracy due to only 10 samples |
| Time Constraints | Focus on analysis over dataset generation | Prevented creation of large, diverse, balanced dataset |
Future Work
To address the current limitations and extend this proect, I would propose:
Additional Data Collection
- Run diverse benchmark applications with Zerosum
- Benchmarks are widely accepted and already labeled
- Would provide standardized comparison points
Expanded Execution Environment
- Deploy test workloads on a cluster environment
- Generate models based on both HWT and LWP observations
- Test across heterogeneous hardware configurations
Advanced Analytical Approaches
- Implement time series analysis to identify trends and predict utilization
- Explore unsupervised learning approaches for anomaly detection
- Develop transfer learning techniques to adapt models across different systems
End-User Tool Development
- Create intuitive visualizations for non-expert users
- Develop automated recommendation systems for resource allocation
- Integrate with existing job scheduling systems (like Argobots or IRIS)
Impact
The impact my methodologies would have is better in situ application utilization visibility for those interested in performance using heterogeneous devices. While zerosum is doing the the difficult work of tapping into sources, consolidating the data, analyzing configurability issues, and producing results, the visualization of these results is limited. With further exploration, I would hope that my results can produce a more user friendly utilization analysis. If my analysis is incorrect, my models could cause more confusion about resource utilization and likely discourage users from using resource monitoring.